Presentation for sc-RNA sequencing

Author

Lucas Rincón de la Rosa

Published

February 5, 2024

This notebook contains the basics of single cell RNA sequencing (scRNA-seq) data processing, imputation and analysis methods. In Section 1 I include a brief explanation of the problem we are dealing with along with similar or related work that has been done up until today. In Section 2 I explain how to work with this type of data with python and the last section cover the main steps: processing the data such as normalizing and log-transforming it, imputing missing values and analyzing the results using clustering algorithms.

1 Introduction

Single cell RNA sequencing data is obtained, as its name says, at cell level. Each cell (observation) has thounsands of genes (feautures) in their transcriptomic background. Depending on the cells and their function, some genes will be expressed and some others no, which at the end comes down to having positive counts for a gene or none. At the end, the data we are deling with is a matrix containing n observations as rows and m features as columns, and each entry represents the counts of a particular gene in a particular cell.

The main problem we are dealing with are missing valuess, those values in the count matrix that correspond to a zero. Although, normally, the vast majority of genes are not expressed in a cell, truth is that when obtaining the transciptomic background there are many values that are lost, counting as zero some genes that may be expressed. This way, we can differentiate between two types of zeros in the matrix:

  • Biological zeros: those that correspond to natural zeros, that is, genes that are trully not expressed in the cell.

  • Technical zeros: expressed genes whose count has been lost in the adquisition process.

Here comes the motivation of coming up with a way to fill (impute in scRNA vocabulary) the technical zeros while preserving the biological ones. This means that we have two challenges, since we want to fill missing values without knowing which zeros correspond to one class or the other.

Main imports and project structure

First of all, we set the path to our main folder and do classical imports.

Code
import os
os.chdir('/media/user/T7 Touch/PhD_Lucas/rna_sequencing/')

import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
import torch
import matplotlib.pyplot as plt

sc.settings.verbosity = 0 # errors (0), warnings (1), info (2), hints (3)

Just in case there are no missunderstanding when working with the directories, the tree structure of the project is printed below. See that the present notebook is inside the notebooks folder whereas the data we will be working on in on data/external.

Code
!tree -d -L 3
.
├── data
│   ├── external
│   │   ├── pbmc3k
│   │   └── pbmc68k
│   └── internal
├── experiments
├── figures
├── models
├── notebooks
└── src
    └── __pycache__

11 directories

2 Reading data as AnnData object

The main library that we will be using is scanpy which contains multiple functions to both read the scRNA datasets as well as to process and visulize them.

All the experiments will be done using the PBMC-68k dataset. It is encoded in .mtx format and can be loaded as an AnnData object using the .read_10x_mtx function. The function asks for the path of the folder containing the .mtx file and the names we want to use for the genes (variables). By default it uses symbols but we can also call it with ‘gene_ids’.

Code
data_path = 'data/external/pbmc68k/'  # the directory with the `.mtx` file

adataCopy = sc.read_10x_mtx(
    path = data_path,            # the directory with the `.mtx` file
    var_names = 'gene_symbols',  # use gene symbols for the variable names (variables-axis index)
    cache = False)    

By calling the AnnData object we can see that we have 5898240 cells (observations) and 32738 genes (variables). In this case, all the variables have a unique name (identifier) but in case that we get a warning, we can call .var_names_make_unique() to make all the variables names unique.

Code
adataCopy
AnnData object with n_obs × n_vars = 5898240 × 32738
    var: 'gene_ids'

Names of cells and genes can be accesed by calling .obs_names and .var_names respectively.

2.1 Adding annotations

We also have a supplementary file containing the annotation for ~68k cells out of the ~5.8M, where they label each class in one out of 11 groups. To add the cell type to our data, we can read the .tsv file and since both the annotations and the .obs are dataframes, we can merge them by keys.

Code
anno = pd.read_csv('data/external/pbmc68k/annotations.tsv', sep='\t')
anno.head()
TSNE.1 TSNE.2 barcodes celltype
0 7.565540 0.441370 AAACATACACCCAA-1 CD8+ Cytotoxic T
1 2.552626 -25.786672 AAACATACCCCTCA-1 CD8+/CD45RA+ Naive Cytotoxic
2 -5.771831 11.830846 AAACATACCGGAGA-1 CD4+/CD45RO+ Memory
3 1.762556 25.979346 AAACATACTAACCG-1 CD19+ B
4 -16.793856 -16.589970 AAACATACTCTTCA-1 CD4+/CD25 T Reg
Code
print("The lenght of the annotations file is:", len(anno))
The lenght of the annotations file is: 68579

We can see that the ‘barcodes’ column corresponds to the name of the cells. In order to not modify the original dataset (just in case we want to manipulate it later) we can make a copy of the AnnData object. We add a new column to the cells dataframe (obs) adding the values of the merged dataframe and lastly we drop all the cells that have NaN values (in the celltype column).

Code
adata = adataCopy.copy()
inner = pd.merge(adata.obs, anno, left_index=True, right_on='barcodes', how='left')
adata.obs['celltype'] = inner['celltype'].values
idx = np.where(adata.obs['celltype'].notna())[0]
adata = adata[idx, :]
Code
name = 'models/adata.h5ad'
adata.write(name)

print("AnnData object saved in:", name)
AnnData object saved in: models/adata.h5ad

3 Data preprocessing

When working with scRNA data, there are some previous practices commonly done before doing the analysis (Luecken and Theis 2019). This is done because it has been proved that this preprocessing step helps not only computationally, but also giving better results when classifying the cells.

Luecken, Malte D, and Fabian J Theis. 2019. “Current Best Practices in Single‐cell RNA‐seq Analysis: A Tutorial.” Molecular Systems Biology 15 (6): e8746. https://doi.org/https://doi.org/10.15252/msb.20188746.

Some commonly used operations are:

  • Quality control: this is done to ensure that low quality cells are disregarded from the dataste for the analysis. They usually correspond to dying cells or cells with a broken membrane which can cause malinterpretation of the counts.

  • Filtering: this can be done to both cells and genes. In scRNA-seq vocabulary, this strictly means to get rid of those cells (genes) that has less than a certain amount of genes (cells respectivly).

  • Normalization per cell: tipically the total number of genes counted on each cell vary a lot (even in order of magnitude) so it is quite common to chose a fixed number as total counts per cell. For usual imputation methods and clustering analysis, is quite common to use 1e6 as the total count (count per million). Nevertheless, we will also be using the OT distance, which works with probability distributions and thus, the total count should be set to 1.

  • Log-transform: typically \(\log(x+1)\) is used. The one is added to avoid null values but also to keep the zeros untouched. This helps reducing the variance of the data. Nevertheless, this changes the probability distribution, so it should be only used in those analysis where the OT distance doesn’t take part.

  • Sampling: since we have a quite large datasets and computations are costly, we will also reduce the size of the dataset. Cells will be reduced to a certain percentage keeping the same distribution than the original dataset and the number of genes by keeping only those that have the most variate features. In this case, we will keep around 10% of cells and reduce the genes to 1k approximatly.

3.1 Quality Control (QC)

The first step in quality control is to remove low-quality reads from the dataset. When a cell has a low number of detected genes, a low count depth and a high fraction of mitochondrial counts it might have a broken membrane which can indicate a dying cell. As these cells are usually not the main target of our analysis and might distort our downstream analysis, we are removing them during quality control. For identifying them, we define cell quality control (QC) threshold. Cell QC is typically performed on the following three QC covariates:

  • The number of counts per barcode (count depth)

  • The number of genes per barcode

  • The fraction of counts from mitochondrial genes per barcode

It should be clear that cells having a high fraction of mitochondrial counts doesn’t strictly mean that is a dying cell with a broken membrane, but rather that it can be. That is why we will be looking at three characteristics (named above). We don’t want to end up disregardig more cells that we should, that is why this quality control metrics should be applied with a reasonable thresholding

Code
adata = sc.read_h5ad("models/adata.h5ad")

# Include mitochondrial genes (usually named as MT-)
adata.var["mt"] = adata.var_names.str.startswith("MT-")

# Peform quality control to all data and those genes related to mitochondrial cells
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], percent_top=[20], log1p=True, inplace=True)

fig, [ax1, ax2] = plt.subplots(nrows=1, ncols=2, figsize=(11,5))

ax1.hist(adata.obs['total_counts'], 100)
ax1.set_title('Total counts per cell')
ax1.set_xlabel('total_counts')
ax1.set_ylabel('frecuency')

sub = ax2.scatter(adata.obs['total_counts'], adata.obs['n_genes_by_counts'], s=0.3, c='tab:blue', label='8%')
fil = adata[adata.obs['pct_counts_mt'] >= 8]
ax2.scatter(fil.obs['total_counts'], fil.obs['n_genes_by_counts'], s=1.5, c='orange', label='>8%')
ax2.set_title('Expressed genes vs Total counts')
ax2.set_xlabel('total_counts')
ax2.set_ylabel('n_genes_by_counts')
plt.legend(loc="lower right")

plt.subplots_adjust(wspace=0.3)  # Adjust the width space

plt.show()

Figure 1: Quality control analysis

As we can see in Figure 1, the cells with more fraction counts from mitochondrial genes are also the ones with less counts and less genes expressed. This might well reflect cells with broken membranes whose cytoplasmic mRNA has leaked out and therefore only the mRNA in the mitochondria is still present, so we will ignore them to our analysis.

To define cells as outliers, we can use the Median Absolute Devaition (MAD) defined as follows: \[MAD=median\left(\left|X_i-median\left(X\right)\right|\right)\]

In this case, we will set cells as outliers if their differ by 5 MADs (quite permissive filter) and also those cells with more than 8% of mitochondrial counts.

Code
from scipy.stats import median_abs_deviation

def is_outlier(adata, metric: str, nmads: int):
    M = adata.obs[metric]
    outlier = (M < np.median(M) - nmads * median_abs_deviation(M)) | (
        np.median(M) + nmads * median_abs_deviation(M) < M
    )
    return outlier

adata.obs["outlier"] = (
    is_outlier(adata, "log1p_total_counts", 5)
    | is_outlier(adata, "log1p_n_genes_by_counts", 5)
    | is_outlier(adata, "pct_counts_in_top_20_genes", 5)
)
print('Cell outliers by MADs:')
print(adata.obs.outlier.value_counts()[True])

adata.obs["mt_outlier"] = is_outlier(adata, "pct_counts_mt", 3) | (
    adata.obs["pct_counts_mt"] > 8
)
print('\nCell outliers by ptc of mitochondrial genes:')
print(adata.obs.mt_outlier.value_counts()[True])

print(f"\nTotal number of cells before filtering: {adata.n_obs}")
adata = adata[(~adata.obs.outlier) & (~adata.obs.mt_outlier)].copy()

print(f"Number of cells after filtering of low quality cells: {adata.n_obs}")
Cell outliers by MADs:
2005

Cell outliers by ptc of mitochondrial genes:
4397

Total number of cells before filtering: 68579
Number of cells after filtering of low quality cells: 62511
Code
fig, [ax1, ax2] = plt.subplots(nrows=1, ncols=2, figsize=(11,5))

ax1.hist(adata.obs['total_counts'], 100)
ax1.set_title('Total counts per cell')
ax1.set_xlabel('total_counts')
ax1.set_ylabel('frecuency')

sub = ax2.scatter(adata.obs['total_counts'], adata.obs['n_genes_by_counts'], s=0.3, c=adata.obs['pct_counts_mt'])
ax2.set_title('Expressed genes vs Total counts')
ax2.set_xlabel('total_counts')
ax2.set_ylabel('n_genes_by_counts')
plt.colorbar(sub)

plt.subplots_adjust(wspace=0.3)  # Adjust the width space

plt.show()

Figure 2: Quality control analysis after filtering by MAD and mitochondrial counts

As one can see in Figure 2, now the total counts per cell are more centered (there exists less varianc) and all the cells have less than 3% of mitochondrial counts. This makes the dataset much more stable and consistent for the analysis.

Code
name = 'models/adataQC.h5ad'
adata.write(name)

print("AnnData object saved in:", name)
AnnData object saved in: models/adataQC.h5ad

3.2 Filtering + normalization + log-transform

Code
adata = sc.read_h5ad("models/adataQC.h5ad")
adataFilter = adata.copy()

cells, genes = 6, 400
sc.pp.filter_cells(adataFilter, min_genes=genes)
sc.pp.filter_genes(adataFilter, min_cells=cells)

# sc.pp.calculate_qc_metrics(adataFilter, percent_top=None, log1p=False, inplace=True)
sc.pp.normalize_total(adataFilter, target_sum=1)
sc.pp.log1p(adataFilter)

diff = np.array(adata.shape) - np.array(adataFilter.shape)
print(f"Filtered out {diff[0]} cells that have less than {genes} genes expressed")
print(f"Filtered out {diff[1]} genes that are detected in less than {cells} cells")
print("\nNew size of dataset:", adataFilter.shape)
Filtered out 7596 cells that have less than 400 genes expressed
Filtered out 16899 genes that are detected in less than 6 cells

New size of dataset: (54915, 15839)

As we can see, there are many cells and genes that are not enough expressed and so they act more like noise to our analysis than relevant information. Also, in Figure 3 we can see that by taking the log-transform the data is better distributed and the deviation between counts is reduced and they are almost centered in 1, so that we can use them as probability distributions.

Code
fig, [ax1, ax2] = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(10, 5))

ax1.set_title("Total counts")
ax1.hist(adataFilter.obs["total_counts"], bins=100)

ax2.set_title("Shifted logarithm")
ax2.hist(adataFilter.X.sum(1), bins=100)
plt.show()

Figure 3: Total counts before and after normalization
Code
name = 'models/adataFilter.h5ad'
adataFilter.write(name)

print("AnnData object saved in:", name)
AnnData object saved in: models/adataFilter.h5ad

3.3 Sampling

Genes sampling: to select the genes that vary the most, we can either compute the standar deviation (std) or use the sc.pp.highly_variable_genes function. This last one will create a boolean variable in the genes indicating if they are or not among the most variate ones (regarding a threshold) but using a different metric than the std.

Code
adata = sc.read_h5ad("models/adataFilter.h5ad")
adataFeat = adata.copy()
n_genes = 3000
sc.pp.highly_variable_genes(adataFeat, n_top_genes=n_genes)
adataFeat = adataFeat[:,adataFeat.var['highly_variable']==True]

print(f'{n_genes} most variable genes selected as features')
print("New size of dataset:", adataFeat.shape)

name = 'models/adataFeat.h5ad'
adataFeat.write(name)

print("\nAnnData object saved in:", name)
3000 most variable genes selected as features
New size of dataset: (54915, 3000)

AnnData object saved in: models/adataFeat.h5ad

Cell sampling: we can start by sampling the cells to a 10% of their size keeping the same distribution. I recall that this is just to speed up computations (nice approximation to start up), but it is recommended to use all the observations for a formal analysis.

Code
import scipy.stats as stats

adata = sc.read_h5ad("models/adataFilter.h5ad")
cells_names = adata.obs_names
class_labels = adata.obs['celltype']

# Stratify the data
strata = {}
for label in class_labels.unique():
    strata[label] = cells_names[class_labels == label]

# Calculate sample sizes
strata_sample_sizes = {
    label: int(0.1 * len(stratum)) for label, stratum in strata.items()
}

# Perform stratified sampling
sample_names = np.array([])
for label, stratum in strata.items():
    sample_size = strata_sample_sizes[label]
    sample_names = np.concatenate((sample_names, np.random.choice(stratum, sample_size)))

idxs = np.where(np.isin(adata.obs_names, sample_names))[0]
adataSample = adata[idxs]

diff = np.array(adata.shape) - np.array(adataSample.shape)
print(f"Sampled {adataSample.shape[0]} cells from a total of {adata.shape[0]}")
print('\nNew size of dataset:', adataSample.shape)

By ploting the cell type distribution in both the original dataset and the one filtering the cells (the filtering in the genes doesn’t change anything here) we can see in Figure 4 that we are dealing with an unbalanced dataset, where the first txo/three most represented classes make up to about 70% of the dataset and the four least represented don’t add up to even 5%. That is why we should come up with a way to make the dataset more balanced between classes, so that when we do the clustering we take into account all cell types.

Code
adata = sc.read_h5ad("models/adata.h5ad")
adataFeat = sc.read_h5ad("models/adataFeat.h5ad")

dic = {"Original": adata,
       "Filtered": adataFeat}

i=0
bar_width = round(0.9/len(dic), 2)
r = np.arange(11)
fig, ax = plt.subplots(figsize=(6,5))
for key,data in dic.items():
    values = data.obs['celltype'].value_counts(normalize=True) * 100
    labels = values.index

    idx = [x + i*bar_width for x in r]
    i += 1
    ax.bar(idx, values, width=bar_width, label=key)

# Customize the plot
plt.xlabel('Categories')
plt.ylabel('Percentage')
plt.xticks([r + bar_width/2 for r in range(len(labels))], labels, rotation=45, ha='right')
plt.title('Comparison of cell type distributions')
plt.legend()

plt.show()

Figure 4: Cell type distribution for the original dataset and the filtered by cells one

A usual approach is to under-sample the most represented classes. This means, sample the same amount of observations for all classes taking as reference the least represented one. In this case we will keep working with the filtered dataset and we will sample the dataset in a different number of observations. Let’t say: 15, 30, 45, 60 and 74 since the least represented class has around 75 cells.

Code
adataFeat = sc.read_h5ad("models/adataFeat.h5ad")
adata = adataFeat.copy()

sizes = [15, 30, 45, 60, 74]

samples = []
celltypes = adata.obs['celltype']
for size in sizes:

    idxs = []
    for cat in celltypes.unique():
        aux = celltypes[celltypes == cat].sample(size)
        idxs += list(aux.index.values)
    
    sample = adata[idxs]
    samples.append(sample)
Code
dic = {}
for i,data in zip(sizes,samples):
    dic[str(i)] = data

i=0
bar_width = round(0.9/len(dic), 2)
r = np.arange(11)
fig, ax = plt.subplots(figsize=(10,5))
for key,data in dic.items():
    values = data.obs['celltype'].value_counts()
    labels = values.index

    idx = [x + i*bar_width for x in r]
    i += 1
    ax.bar(idx, values, width=bar_width, label=key)

# Customize the plot
plt.xlabel('Categories')
plt.ylabel('Percentage')
plt.xticks([r + 2*bar_width for r in range(len(labels))], labels, rotation=45, ha='right')
plt.title('Comparison of cell type distributions for balanced datasets')
plt.legend()

plt.show()

Figure 5: Cell type distribution for the balanced dataset using different dataset sizes

In Figure 5 one can see that we have obtained 5 different balanced datasets.

Code
name = 'models/adataSample.h5ad'
adataSample.write(name)

print("AnnData object saved in:", name)
AnnData object saved in: models/adataSample.h5ad

3.4 Genes distribution

We can do something similar to the barplot containing the cell type distribution, but this time with the most expressed genes. Not to say that we cannot perform a barplot of all genes since we are dealing with 3k for the filtered dataset and over 32k for the original one.

Code
adata = sc.read_h5ad("models/adata.h5ad")
adataFilter = sc.read_h5ad("models/adataFilter.h5ad")
adataHv = sc.read_h5ad("models/adataHv.h5ad")
adataStd = sc.read_h5ad("models/adataStd.h5ad")

dic = {"Original": adata,
       "Filtered": adataFilter,
       "Highly variable": adataHv,
       "Std": adataStd}

for key,data in dic.items():
    sc.pl.highest_expr_genes(
        adata, n_top=30, show=False, save=f'{key}.png'
    )
Code
adata = sc.read_h5ad("models/adata.h5ad")
adataFilter = sc.read_h5ad("models/adataFilter.h5ad")
adataHv = sc.read_h5ad("models/adataHv.h5ad")
adataStd = sc.read_h5ad("models/adataStd.h5ad")

dic = {"Original": adata,
       "Filtered": adataFilter,
       "Std": adataStd}

i = 0
bar_width = 0.2
n_genes = 20
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,10))
for key,data in dic.items():
    sc.pp.calculate_qc_metrics(data, inplace=True)
    data.var['percentage'] = data.var['total_counts'] / data.var['total_counts'].sum() * 100

labels = adata.var['percentage'].sort_values(ascending=False)[:n_genes].index
labels1 = labels[::-1]
r = np.arange(len(labels))
for key,data in dic.items():
    values = data.var['percentage'].loc[labels1]
    idx = [x + i*bar_width for x in r]
    i += 1
    ax[0].barh(idx, values, height=bar_width, label=key)

# Customize the plot
ax[0].set_xlabel(f'Percentage')
ax[0].set_yticks([r + 1.5*bar_width for r in range(len(labels))], labels1)
ax[0].set_xlim(0,10)
ax[0].legend(loc='lower right')

sc.pp.calculate_qc_metrics(adataHv, inplace=True)
adataHv.var['percentage'] = adataHv.var['total_counts'] / adataHv.var['total_counts'].sum() * 100
values = adataHv.var['percentage'].sort_values(ascending=False)[:n_genes]
values = values.sort_values()
labels2 = values.index
ax[1].barh(idx, values, height=0.5, label='Highly variable', color='salmon')

# Customize the plot
ax[1].set_xlabel(f'Percentage')
ax[1].set_yticks([r+0.5 for r in range(len(labels2))], labels2, ha='right')
ax[1].legend(loc='lower right')

fig.suptitle('Comparison of cell type distributions', y=0.95, fontsize=17)
Text(0.5, 0.95, 'Comparison of cell type distributions')

4 Distance metrics

In this section we will perform some distance matrix to the count data. The content is based on (Huizing, Peyré, and Cantini 2022). This way, we will try to distinguish cell-to-cell similarity using different metrics, including OT distance. To do so, we will compute the distance matrix between cells using their transcriptomic vector, so that the matrix has dimension n_cells x n_cells.

First we get the name of the cell type classification (11 different labels) and we sort them by label so that, at the moment of visualization, each cluster is together.

Code
adata = sc.read_h5ad('models/adataSample.h5ad')  # load processed AnnData
clusters = adata.obs['celltype']                 # get the cell type df
idx = np.argsort(clusters.values)                # idx of sorted array

print("The 11 labels for cell type classification are:\n")
print(np.unique(clusters))
The 11 labels for cell type classification are:

['CD14+ Monocyte' 'CD19+ B' 'CD34+' 'CD4+ T Helper2' 'CD4+/CD25 T Reg'
 'CD4+/CD45RA+/CD25- Naive T' 'CD4+/CD45RO+ Memory' 'CD56+ NK'
 'CD8+ Cytotoxic T' 'CD8+/CD45RA+ Naive Cytotoxic' 'Dendritic']
Code
idxs = []
for sample in samples:
    clusters = sample.obs['celltype']                 # get the cell type df
    idxs.append(np.argsort(clusters.values))               # idx of sorted array

4.1 Conventional metrics

We first compute the distance matrix using four different conventional metrics: Euclidean, Manhattan, Cosine and Pearson. We do it using the two different approaches to sample the genes (using highly variable and std) so that we can compare their behaviour.

Code
from scipy.spatial.distance import cdist

metrics = ['euclidean', 'cityblock', 'cosine', 'correlation']

adataList = [adata.X.toarray() for adata in samples]

matrices  = [[], [], [], [], []]
for metric in metrics:
    for i,data in enumerate(adataList):
        print(f"Computing {metric} distance for dataset {i}")
        A = cdist(data, data, metric)
        A /= A.max()

        matrices[i].append(A)
Code
fig, axes = plt.subplots(
    nrows=5, ncols=5,
    # sharex=True, sharey=True,
    figsize=(20,25),
    width_ratios = [1, 1, 1, 1, 0.1]
)

for i in range(len(matrices)):
    for j in range(len(metrics)):
    
        im = axes[i, j].imshow(matrices[i][j][idxs[i]][:,idxs[i]])
    
    fig.colorbar(im, cax=axes[i, 4])
    axes[i,0].set_ylabel(f'Size {sizes[i]}', fontsize=20)

for j in range(len(metrics)):
    axes[0, j].set_title(metrics[j].title(), fontsize=20)

plt.tight_layout()
plt.show()

Figure 6: Comparative of conventional metrics applied to different dataset sizes

In Figure 6 we can see that the cell-to-cell metrics can help to cluster the data. In this case, all datasets are better classified when using Cosine or Pearson correlation distance. Also, the bigger the dataset, the smaller the noise of the matrices, having smoother groups.

4.2 OT distance

When computing the OT distance, as explained in (Huizing, Peyré, and Cantini 2022), they obtained the best results for real scRNA data using the cosine distance for calculating the cost matrix for the OT problem.

Huizing, Geert-Jan, Gabriel Peyré, and Laura Cantini. 2022. Optimal transport improves cell–cell similarity inference in single-cell omics data.” Bioinformatics 38 (8): 2169–77. https://doi.org/10.1093/bioinformatics/btac084.
Code
import otscomics

adataList = [adata.X.toarray().astype(np.double) for adata in samples]
adataList = [data + 1e-9 for data in adataList]
adataList = [(data.T/data.sum(1)).T for data in adataList]

matricesOT  = []
for i,data in enumerate(adataList):
    print(f"Computing OT distance for dataset {i}")

    C = otscomics.cost_matrix(data.T, 'cosine')
    A_ot, errors = otscomics.OT_distance_matrix(
        data=data.T, cost=C, eps=.1,
        dtype=torch.double, device='cuda'
    )

    matricesOT.append(A_ot)
Code
fig, axes = plt.subplots(
    nrows=1, ncols=5,
    # sharex=True, sharey=True,
    figsize=(20,5),
    width_ratios = [1, 1, 1, 1, 0.1]
)

for i in range(len(matricesOT)):
    
    im = axes[i].imshow(matricesOT[i][idxs[i]][:,idxs[i]])
    
fig.colorbar(im, cax=axes[i])
# axes[i].set_ylabel(f'Size {sizes[i]}', fontsize=20)

plt.tight_layout()
plt.show()

Figure 7: Comparative of OT metrics applied to different dataset sizes

5 Imputation methods

Once we have filtered and process our data, we can pass to the imputation process, meaning that we can try to use different methods (ALRA, MAGIC and SinkHorn) in order to recover the technical-zeros from the sparse matrix. In this case, we will compare how good or bad the data is grouped in clusters depending on the imputation methods used, as well as the original raw data. This clusters will be compared with the true labels where each cell is classified in one out of eleven cell types. This type of analysis is based on the review (Hou et al. 2020) where they compare a good amount of imputation mehtods.

Hou, Wenpin, Zhicheng Ji, Hongkai Ji, and Stephanie C. Hicks. 2020. A systematic evaluation of single-cell RNA-sequencing imputation methods.” Genome Biology 21 (1): 218. https://doi.org/10.1186/s13059-020-02132-x.
Code
adataHv = sc.read_h5ad('models/adataHv.h5ad') 
adataStd = sc.read_h5ad('models/adataStd.h5ad')

dataList = {"Hv": adataHv, "Std": adataStd}

5.1 ALRA

This method is developed in (Linderman, Zhao, and Kluger 2018). It consists in taking a low-rank approximation of the count matrix to impute all missing values (technical + biological zeros) and then recover the true zeros (biological) using a hipothesis in the data distribution. They assume that all biological zeros of the same gene are simmetrically distributed around 0.

Linderman, George C., Jun Zhao, and Yuval Kluger. 2018. “Zero-Preserving Imputation of scRNA-Seq Data Using Low-Rank Approximation.” bioRxiv. https://doi.org/10.1101/397588.

Figure 8: Overview of the ALRA imputation scheme. A Measured expression matrix contains technical zeros (in each block) and biological zeros (outside each block). B Low-rank approximation via SVD results in a matrix where every zero is completed with a non-zero value (i.e., biological zeros are not preserved). C The elements corresponding to biological zeros for each gene are symmetrically distributed around 0, so the magnitude of the most negative elements in each row is also approximately the magnitude of the most positive values assigned to biological zeros. Therefore, thresholding the values in each row, restores almost all of the biological zeros in the imputed matrix (D).
Code
from src.ALRA import choose_k, ALRA

for key,data in dataList.items():
    normalizedMatrix = data.X.toarray().astype(float)

    k = choose_k(normalizedMatrix)
    A_alra = ALRA(normalizedMatrix, k)

    adataALRA = ad.AnnData(A_alra)
    adataALRA.obs_names = data.obs_names
    adataALRA.obs['celltype'] = data.obs['celltype']
    adataALRA.var_names = data.var_names

    name = f'models/adataALRA{key}.h5ad'
    adataALRA.write(name)

    print("AnnData object saved in:", name)
AnnData object saved in: models/adataALRAHv.h5ad
AnnData object saved in: models/adataALRAStd.h5ad

5.2 MAGIC

Code
import scprep
import magic

for key,data in dataList.items():
    normalizedMatrix = data.X.toarray().astype(float)

    magicOP = magic.MAGIC()
    A_magic = magicOP.fit_transform(normalizedMatrix, genes='all_genes')
    adataMAGIC = ad.AnnData(A_magic)

    adataMAGIC.obs_names = data.obs_names
    adataMAGIC.obs['celltype'] = data.obs['celltype']
    adataMAGIC.var_names = data.var_names

    name = f'models/adataMAGIC{key}.h5ad'
    adataMAGIC.write(name)

    print("\nAnnData object saved in:", name)
Calculating MAGIC...
  Running MAGIC on 5601 cells and 3000 genes.
  Calculating graph and diffusion operator...
    Calculating PCA...
    Calculated PCA in 3.21 seconds.
    Calculating KNN search...
    Calculated KNN search in 2.86 seconds.
    Calculating affinities...
    Calculated affinities in 2.83 seconds.
  Calculated graph and diffusion operator in 8.91 seconds.
  Calculating imputation...
  Calculated imputation in 0.99 seconds.
Calculated MAGIC in 9.93 seconds.

AnnData object saved in: models/adataMAGICHv.h5ad
Calculating MAGIC...
  Running MAGIC on 5601 cells and 3000 genes.
  Calculating graph and diffusion operator...
    Calculating PCA...
    Calculated PCA in 2.54 seconds.
    Calculating KNN search...
    Calculated KNN search in 3.16 seconds.
    Calculating affinities...
    Calculated affinities in 2.97 seconds.
  Calculated graph and diffusion operator in 8.67 seconds.
  Calculating imputation...
  Calculated imputation in 1.06 seconds.
Calculated MAGIC in 9.75 seconds.

AnnData object saved in: models/adataMAGICStd.h5ad

5.3 Sinkhorn

Code
from hyperimpute.plugins.imputers import Imputers

for key,data in dataList.items():
    normalizedMatrix = data.X.toarray().astype(float)

    normalizedMatrix[normalizedMatrix == 0] = np.nan

    plugin = Imputers().get("sinkhorn")
    df_sink = plugin.fit_transform(normalizedMatrix)
    normalizedMatrix = np.array(df_sink)

    adataSink = ad.AnnData(normalizedMatrix)
    adataSink.obs_names = data.obs_names
    adataSink.obs['celltype'] = data.obs['celltype']
    adataSink.var_names = data.var_names

    name = f'models/adataSink{key}.h5ad'
    adataSink.write(name)

    print("AnnData object saved in:", name)
AnnData object saved in: models/adataSinkHv.h5ad
AnnData object saved in: models/adataSinkStd.h5ad

6 Visualization

One of the previous steps before analysing the data is to treat it so that the statisticals computations make more sense. In this case, we know that scRNA-sequencing suffers from the so called ‘curse of dimensionalty’. Even though we tend to think that the more features (genes) the best to classify the observations, it is not always true. This often happens in this field where the number of variables is too high, so that there are many of them that don’t actually add information, but rather noise.

Code
adataHv = sc.read_h5ad('models/adataHv.h5ad')
adataStd = sc.read_h5ad('models/adataStd.h5ad')

adataALRAHv = sc.read_h5ad('models/adataALRAHv.h5ad')
adataALRAStd = sc.read_h5ad('models/adataALRAStd.h5ad')

adataMAGICHv = sc.read_h5ad('models/adataMAGICHv.h5ad')
adataMAGICStd = sc.read_h5ad('models/adataMAGICStd.h5ad')

adataSinkHv = sc.read_h5ad('models/adataSinkHv.h5ad')
adataSinkStd = sc.read_h5ad('models/adataSinkStd.h5ad')

dataList = {"Raw Hv": adataHv, "Raw Std": adataStd,
            "ALRA Hv": adataALRAHv, "ALRA Std": adataALRAStd,
            "MAGIC Hv": adataMAGICHv, "MAGIC Std": adataMAGICStd,
            "Sink Hv": adataSinkHv, "Sink Std": adataSinkStd,
}

6.1 Dimensionality reduction

PCA creates a new set of uncorrelated variables, so called principle components (PCs), via an orthogonal transformation of the original dataset. The PCs are linear combinations of features in the original dataset and are ranked with decreasing order of variance to define the transformation. Through the ranking usually the first PC amounts to the largest possible variance. PCs with the lowest variance are discarded to effectively reduce the dimensionality of the data without losing information.

In order to apply it to our data, first we need to perform the method via .tl.pap or .pp.pca (they are actually the same function) and then scatter plot it using .pl.pca. To plot it we only need to pass the AnnData object since it will look for the components saved in uns, obsm and varm. In case we want the scatter plot to be colored, we can add which variable to take into account (or cell/gene).

Code
for key,data in dataList.items():
    sc.pp.calculate_qc_metrics(data, percent_top=None, log1p=False, inplace=True)
    sc.tl.pca(data, svd_solver='arpack', n_comps=30)
    sc.pl.pca_variance_ratio(data, log=True, n_pcs=30, show=False, save=f'{key}.png')
Code
from PIL import Image

fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(20,10))

path = '/media/user/T7 Touch/PhD_Lucas/rna_sequencing/figures'
figs = os.listdir(path)
for i in range(len(figs)):

    img = np.asarray(Image.open(f'{path}/{figs[i]}'))
    row, col = i%2, i//2
    im = axes[row, col].imshow(img)
    axes[row, col].axis('Off')

axes[0, 0].set_title("Raw", size='large')
axes[0, 1].set_title("ALRA", size='large')
axes[0, 2].set_title("MAGIC", size='large')
axes[0, 3].set_title("Sink", size='large')

plt.suptitle("Higly variable genes (up) and std (down)", fontsize=30)
plt.tight_layout()
plt.show()

6.2 Clustering

Once one has performed the reduciton of dimension to the data, we can actually visulize it in a 2D or 3D graph. Then, clustering mehtods can be applied to this reduced data in order to group the principal components. To do so, we can use different clustering mehtods available in the scanpy module. Here we will show two of them, which make use of a graph structure on the data.

That is why we will need first to calculate the KNN graph on the lower-dimentional data using the .pp.neighbors. Appart from the data, we also need to provide the number of principal components we want to use (how many dimensions out of the total ones we want to keep). This is an important choice and it will depend on the percentage of information that each principal component carries. We can see this using the .pl.pca_variance_ratio function over the data.

We can perform this algorithm using the .tl.leiden function and by passing the reduced data with the KNN graph already performed and the resolution parameter (1 by default) which quantifies (in a way) the number of clusters. We can also pass the neighbors_key in case we have performed multiple KNN graphs and the key_added is case we use different resolutions parameters to compare them.

Code
from sklearn.cluster import KMeans

for key,data in dataList.items():

    matrix = data.X
    if type(matrix) != np.ndarray:
        matrix = matrix.toarray()

    kmeans = KMeans(n_clusters=11, random_state=0, n_init="auto").fit(matrix)
    data.obs['kmeans'] = kmeans.labels_
    data.obs['kmeans'] = data.obs['kmeans'].astype('string')
    print('KMeans performed on', key)
KMeans performed on Raw Hv
KMeans performed on Raw Std
KMeans performed on ALRA Hv
KMeans performed on ALRA Std
KMeans performed on MAGIC Hv
KMeans performed on MAGIC Std
KMeans performed on Sink Hv
KMeans performed on Sink Std
Code
for key,data in dataList.items():
    sc.pp.neighbors(data, n_pcs=20, use_rep='X')
    sc.tl.umap(data)
    print('UMAP performed on', key)
/home/user/environments/rna-seq/lib/python3.10/site-packages/numba/np/ufunc/parallel.py:371: NumbaWarning: The TBB threading layer requires TBB version 2021 update 6 or later i.e., TBB_INTERFACE_VERSION >= 12060. Found TBB_INTERFACE_VERSION = 12050. The TBB threading layer is disabled.
  warnings.warn(problem)
UMAP performed on Raw Hv
UMAP performed on Raw Std
UMAP performed on ALRA Hv
UMAP performed on ALRA Std
UMAP performed on MAGIC Hv
UMAP performed on MAGIC Std
UMAP performed on Sink Hv
UMAP performed on Sink Std
Code
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

for key,data in dataList.items():
    # Convert string labels to numerical labels
    label_encoder = LabelEncoder()
    true_labels = label_encoder.fit_transform(data.obs['celltype'])
    kmeans_labels = data.obs['kmeans'].astype(int)

    # Calculate Adjusted Rand Index (ARI)
    ari = adjusted_rand_score(true_labels, kmeans_labels)

    # Calculate Normalized Mutual Information (NMI)
    nmi = normalized_mutual_info_score(true_labels, kmeans_labels)

    print(key)
    print("Adjusted Rand Index (ARI):", np.round(ari, 3))
    print("Normalized Mutual Information (NMI):", np.round(nmi, 3),'\n')
Raw Hv
Adjusted Rand Index (ARI): 0.159
Normalized Mutual Information (NMI): 0.336 

Raw Std
Adjusted Rand Index (ARI): 0.164
Normalized Mutual Information (NMI): 0.339 

ALRA Hv
Adjusted Rand Index (ARI): 0.171
Normalized Mutual Information (NMI): 0.344 

ALRA Std
Adjusted Rand Index (ARI): 0.173
Normalized Mutual Information (NMI): 0.352 

MAGIC Hv
Adjusted Rand Index (ARI): 0.155
Normalized Mutual Information (NMI): 0.342 

MAGIC Std
Adjusted Rand Index (ARI): 0.171
Normalized Mutual Information (NMI): 0.355 

Sink Hv
Adjusted Rand Index (ARI): 0.0
Normalized Mutual Information (NMI): 0.006 

Sink Std
Adjusted Rand Index (ARI): 0.001
Normalized Mutual Information (NMI): 0.005 
Code
save=f'{key}.png'
Code
for key,data in dataList.items():
    sc.pl.umap(data,
           color=['celltype', 'kmeans'],
           title=['Grount Truth', key],
           legend_loc="right margin",
           wspace=0.5,
           show=False,
           save=f'{key}.png'
)
Code
fig, axes = plt.subplots(nrows=8, ncols=1, figsize=(40,20))

path = '/media/user/T7 Touch/PhD_Lucas/rna_sequencing/figures'
figs = [i for i in os.listdir(path) if i.startswith('umap')]
for i in range(len(figs)):

    img = np.asarray(Image.open(f'{path}/{figs[i]}'))
    im = axes[i].imshow(img)
    axes[i].axis('Off')

plt.tight_layout()
plt.show()